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We present a new class of methods for high-dimensional non- 
parametric regression and classification called sparse additive models 
(SpAM). Our methods combine ideas from sparse linear modeling 
and additive nonparametric regression. We derive an algorithm for 
fitting the models that is practical and effective even when the num- 
ber of covariates is larger than the sample size. SpAM is essentially 
a functional version of the grouped lasso of Yuan and Lin (2006). 
SpAM is also closely related to the COSSO model of iLin and Zhan3 
(|2006l ). but decouples smoothing and sparsity, enabling the use of 
arbitrary nonparametric smoothers. We give an analysis of the the- 
oretical properties of sparse additive models, and present empirical 
results on synthetic and real data, showing that SpAM can be ef- 
fective in fitting sparse nonparametric models in high dimensional 
data. 



1. Introduction. Substantial progress has been made recently on the 
problem of fitting high dimensional linear regression models of the form 
Yi = Xj (5 + ej, for i = Here Yi is a real- valued response, Xi 

is a predictor and is a mean zero error term. Finding an estimate of 
(3 when p > n that is both statistically well-behaved and computation- 
ally efficient has proved challenging; however , under the assum ption that 
the vector /? is sparse, the lasso estimator ( Tibshirani ( 19961 )) has been 
remarkably successful. The lasso estimator (3 minimizes the ^i-penalized 
sum of squares ~ + ^YTj=i with the £i penalty \\f3\\i en- 

couraging sparse solutions, where many components [3j are zero. The good 
empirical success of this estimator has been recently b acked up by r esults 



confirming that it has strong theoretical properties ; see (iBunea et al 



Wainwrightl . 



2007 : 



20061 : 



Greenshtein and Ritovl . 120041 : iMeinshausen and Yul . 1200' 
Zhao and Yul . l2007l ). 

The nonparametric regression model Yi = m{Xi)+ei, where m is a general 
smooth function, relaxes the strong assumptions made by a linear model, but 
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is much more challenging in high dimensions. iHastie and Tibshirani (Il999l ^ 



introduced the class of additive models of the form 
(1) Yi = j2f,{X,,)+ei. 

This additive combination of univariate functions — one for each covariate 
Xj — is less general than joint multivariate nonparametric models, but can 
be more interpretable and easier to fit; in particular, an additive model 
can be estimated using a coordinate descent Gauss-Seidel procedure, called 
backfitting. Unfortunately, additive models only have good statistical and 
computational behavior when the number of variables p is not large relative 
to the sample size n, so their usefulness is limited in the high dimensional 
setting. 

In this paper we investigate sparse additive models (SpAM), which ex- 
tend the advantages of sparse linear models to the additive, nonparametric 
setting. The underlying model is the same as in ([1]), but we impose a spar- 
sity constraint on the index se t j? : /■,■ ^ 0} of functions fj that are not 
identically zero. iLin and Zhaii3 tood ) have proposed COSSO, an extension 



of lasso to this setting, for the case where the component functions fj be- 
long to a reproducing kernel Hilbert space (RKH S). Th e y pen alize the sum 



of the RKHS norms of the component functions. lYuanl ()2007l ) proposed an 
extension of the non-negative garrote to this setting. As with the paramet- 
ric non-negative garrote, the success of this method depends on the initial 
estimates of component functions fj. 

In Section [3l we formulate an optimization problem in the population 
setting that induces sparsity. Then we derive a sample version of the solution. 
The SpAM estimation procedure we introduce allows the use of arbitrary 
nonparametric smoothing techniques, effectively resulting in a combination 
of the lasso and backfitting. The algorithm extends to classification problems 
using generalized additive models. As we explain lat er, SpAM can also b e 



thought of as a functional version of the grouped lasso ( Yuan and Lin . 20061 ) . 



The main results of this paper include the formulation of a convex opti- 
mization problem for estimating a sparse additive model, an efficient backfit- 
ting algorithm for constructing the estimator, and theoretical results that an- 
alyze the effectiveness of the estimator in the high dimensional setting. Our 
theoretical results are of several different types. First, we show that, under 
suitable choices of the design parameters, the SpAM backfitting algorithm 
recovers the correct sparsity pattern asymptotically; this is a property we call 
sparsistence, as a shorthand for "sparsity pattern cons istency." Second, we 



show that that the estimator is persistent, in the sense of lGreenshtein and Ritov 
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(12004), which is a form of risk consistency. Specificahy, we show: 

Sparsistence: P (s = 5) ^ 1 if = 0(e"*), for ^ < f 

Persistence: R{mn) — inf R{h) — > if p„ = 0{e'^ ), for ^ < 1. 

h&Mn 

Here S = {j : fj ^ 0} is the index set for the nonzero components, 
S = {j ■ fj / 0} and A^^ is a class of functions defined by the level of 
regularization. 

In the following section we establish notation and assumptions. In Sec- 
tion [3] we formulate SpAM as an optimization problem and derive a scal- 
able backfitting algorithm. An extension to sparse nonparametric logistic 
regression is presented in Section HI Examples showing the use of our sparse 
backfitting estimator on high dimensional data are included in Section [6l In 
Section [7.11 we formulate the sparsistency result, when orthogonal function 
regression is used for smoothing. In Section 17.21 we give the persistence re- 
sult. Section [8] contains a discussion of the results and possible extensions. 
Proofs are contained in Section [H 

2. Notation and Assumptions. We assume that we are given data 
{Xi,Yi),..., {Xn, Yn) where X, = (X^, . . . , X,j, . . . , X,pf G [0, 1]^ and 

(2) Y^ = m{Xi) + a 
with ei ~ 7V(0, 0-2) and 

p 

(3) m{x) =^fj{xj). 

j=i 

Denote the joint distribution of {Xi,Yi) by P. For a function / on [0,1] 
denote its L2{P) norm by 




For j G {l,...,p}, let TLj denote the Hilbert subspace of L2{P), of 
P— measurable functions fj{xj) of the single scalar variable Xj with zero 
mean, E(/j(Xj)) = 0. Thus, TLj has the inner product 



(5) 



(/„/i)=E(/,(X,)/j(X,)) 
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and WfjW = ^jE{fj{Xj)^) < oo. Let W = © © • • • © denote the 
Hilbert space of functions of {xi, . . . , Xp) that have the additive form: m{x) = 
Ej fji^j), with fj e HjJ = l,...,p. 

Let {ipjk,k = 0,1,...} denote a uniformly bounded, orthonormal basis 
with respect to Lebesgue measure on [0, 1] . Unless stated otherwise, we as- 
sume that fj E Tj where 



00 



(6) T, = /, G n, : fj{x,) = Y: Pjki^jk{xjl f^hf"^ < 

I k=0 k=0 ) 

for some < C < 00. We shall take Vj = 2 although the extension to other 
levels of smoothness is straightforward. It is also possible to adapt to z^j 
although we do not pursue that direction here. 

Let Amin(^) and Amax(^) denote the minimum and maximum eigenvalues 
of a square matrix A. If v = {vi, . . . ,Vk)'^ is a vector, we use the norms 



(7) 



. E^l' ibiii = E 



3. Sparse Backfitting. The outline of the derivation of our algorithm 
is as follows. We first formulate a population level optimization problem, and 
show that the minimizing functions can be obtained by iterating through a 
series of soft-thresholded univariate conditional expectations. We then plug 
in smoothed estimates of these univariate conditional expectations, to derive 
our sparse backfitting algorithm. 

Population SpAM. For simplicity, assume that E(yi) = 0. The standard 
additive model optimization problem in L2{P) (the population setting) is 

(8) ,.J"?<< lE(y-E?=i/,(^,))' 
fjeHj,i<3<p \ -J I 

where the expectation is taken with respect to X and the noise e. Now 
consider the following modification of this problem that introduces a scaling 
parameter for each function, and that imposes additional constraints: 

(9) ...^Pif,^^, iE(y-E^=i%,(^,);' 



V 

(10) subject to: ^ < 

(11) e(5|) = 1, i = l,...,p. 
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noting that gj is a function while (3 = . . . is a vector. The con- 

straint that P lies in the ^i-bah {/? : < L} encourages sp arsity of the 
estimated /?, just as for the parametric lasso ( Tibshirani . 19961 ) . 

It is convenient to re-express the minimization in the following equivalent 
form: 



(12) 
(13) 



min E(Y-j:%iMXj 



subject to: ^ y'E(/2(X,)) < L. 



The optimization problem in ()12p can also be written in the penalized 
Lagrangian form, 



(14) 



(15) 



Theorem 1. The minimizers fj G Tij of ( f j^[ ) satisfy 

A 



fi 



1 



a.s. 



where [•]+ denotes the positive part, and Pj = E[Rj \ Xj] denotes the projec- 
tion of the residual Rj =Y — J2kj^j fk{Xk) onto 7ij. 

The proof is given in Section O 

At the population level, the fjS can be found by a coordinate descent 
procedure that fixes (/^ : k ^ j) and fits /-,• by equation (jlSp . then iterates 
over j. 



Data version of SpAM. To obtain a sample version of the population 
solution, we insert sa mple estimates into the pop ulation algorithm, as in 
standard backfitting ( Hastie and Tibshiranil . Il999l ). Thus, we estimate the 



projection Pj 
(16) 



E(i?j I Xj) by smoothing the residuals: 



P. 



SjRj 



where Sj is a linear smoother, such as a local linear or kernel smoother. Let 
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SpAM Backfitting Algorithm 



Input: Data regularization parameter A. 

Initialize fj = 0, for j = 1, . . . ,p. 
Iterate until convergence: 

For each j = 1, . . . ,p: 

(1) Compute the residual: Rj = Y — J2k^j fki^k)', 

(2) Estimate Pj = E[Rj \ Xj] by smoothing: Pj = SjRj; 

(3) Estimate norm: s| = ^ Yli=i ^/(O; 

(4) Soft-threshold: fj = [1 - X/sj]^Pj; 

(5) Center: fj <— fj — mean(/j). 

Output: Component functions fj and estimator •m{Xi) = J2j fji-^ij)- 

Fig 1. The SpAM backfitting algorithm. The first two steps in the iterative algorithm are 
the usual backfitting procedure; the remaining steps carry out functional soft thresholding. 



be the estimate of yE(P?). Using these plug-in estimates in the coordinate 
descent procedure yields the SpAM backfitting algorithm given in Figure [TJ 
This algorithm can be seen as a functional version of the coordinate de- 
scent algorithm for solving the lasso. In particular, if we solve the lasso by 
iteratively minimizing with respect to a single coordinate, each iteration is 
given by soft thresholding; see Figured Convergence pro perties of variants 
of thi s simp le algorithm have been recently treated by iDaubechies et al. 
(|2004l . 120071 ) . Our sparse backfitting algorithm is a direct generalization of 
this algorithm, and it reduces to it in case where the smoothers are local 
linear smoothers with large bandwidths. 

Basis Functions. It is useful to express the model in terms of basis func- 
tions. Recall that Bj = {ipjk '■ /c = 1, 2, . . .) is an orthonormal basis for Tj 
and that sup^ \ '^jk{^)\ ^ B for some B. Then 

oo 

(18) fj{xj) = ^ Pjki^jkixj) 

k=l 

where f3jk = J fj{xj)tpjk{xj)dxj. 
Let us also define 

d 

(19) fj{xj) = Pjki^jk{xj) 

k=l 
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Coordinate Descent Lasso 



Input: Data {Xi,Yi), regularization parameter A. 
Initialize f3j = 0, for j = 1, . . . ,p. 
Iterate until convergence: 

For each j = 1, . . . ,p: 

(1) Compute the residual: Rj = Y — J2k^j Pk^k] 



(2) Project residual onto Xj: Pj = Xj Rj 

(3) Soft-threshold: ]3j = [1 - X/\Pj\]^Pj; 
Output: Estimator fh{Xi) = f^j^ij- 



Fig 2. The SpAM backfitting algorithm is a functional version of the coordinate descent 
algorithm for the lasso, which computes (5 — argmin ^\\Y — X/jyl + A||/3|li. 



where d = is a truncation parameter. For the Sobolev space Tj of order 
two we have that /, — /, = 0{l/d). Let S = {j : fj / 0}. Assum- 



S"! = 0(1) it follows that 



\m 



n 



1/5 



yielding truncation bias 



ing the sparsity condition 
where rh = J2j fj ■ The usual choice is d 
\\m — m|p = 0(n~^/^). 

In this setting, the smoother can be taken to be the least squares pro- 
jection onto the truncated set of basis functions {ipji, • • • i '4'jd}'j this is also 
called orthogonal series smoothing. Let denote the n x dn matrix given 
by = 'ip£{Xij). The smoothing matrix is the projection matrix Sj = 

^'j(^'J^'j)"^^'J. In this case, the backfitting algorithm in Figured] is exactly 
the coordinate descent algorithm for minimizing 



(20) 



1 

2n 



which is the sample version of (jl4|) . In Section 17.11 we prove theoretical 
properties assuming that this particular smoother is being used. 



Connection with the Grouped Lasso. The Sp AM model can be t hought 
of as a functional version of the grouped lasso ( Yuan and Lin . 20061 ) as we 
now explain. Consider the following linear regression model with multiple 
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factors, 



(21) 



where y is an n x 1 response vector, e is an n x 1 vector of iid mean zero 
noise, Xj is an n x dj matrix corresponding to the j-th factor, and f3j is 
the corresponding dj x 1 coefhcient vector. Assume for convenience (in this 
subsection only) that each Xj is orthogonal, so that Xj Xj = 1^^ , where Idj 
identity matrix. We use X = (Xi, . . . , Xp^) to denote the fuh 



is the dj X dj 



design matrix and use [3 = {Pf , . . . , Pp^Y to denote the parameter. 

The grouped lasso estimator is defined as the solution of the following 
convex optimization problem: 



Pn 



(22) 



/3(An) 



arg min \ \Y 



X(3\\l + \nY. 



where 



scales the jth term to compensate for different group sizes. 



It is obvious that when dj = 1 for j = 1, . . . , the grouped lasso becomes 
the standard lasso. From the KKT optimality conditions, a necessary and 
sufficient condition for /3 = (/3^, . . . ,/3j)"^ to be the grouped lasso solution 
is 



(23) 



- Xj{Y - Xf3) + 



yp, + 0, 



X]{J-Xp)\\ < AV^, V/3, 



0. 



Based on this stationary condition, an it erative blockwi s e coo rdinate descent 
algorithm can be derived; as shown by Yuan and Lin ( 20061 ). a solution to 
([25]) satisfies 



(24) 



P, 



\\SA\ 



where = Xj{Y - XP\j), with = (/?f , . . . , pj_^, 



0^,P^ 



'j+V 



iteratively applying (p^ . the grouped lasso solution can be obtained. 

As discussed in the introduction, the COSSO model of iLin and Zhang 
( 20061 ) replaces the lasso constraint on J2j \Pj\ with a RKHS constraint. The 
advantage of our formulation is that it decouples smoothness (gj G Tj) and 
sparsity {J2j \Pj\ — This leads to a simple algorithm that can be carried 
out with any nonparametric smoother and scales easily to high dimensions. 
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4. Sparse Nonpar ametric Logistic Regression. The SpAM back- 
fitting procedure can be extended to nonparametric logistic regression for 
classification. The additive logistic model is 

(25) F{Y = l\X)^p{X;f)- \ ^ ' ' 



l + exp(E^^i/,(Xj 

where Y G {0, 1}, and the population log-likelihood is 

(26) £(/) = E [Yf{X) - log (1 + exp f{X))] . 

Recall that in the loca l scor ing algorithm for generalized additive models 
( Hastie and Tibshirani . 19991 ) in the logistic case, one runs the backfitting 



procedure within Newton's method. Here one iteratively computes the trans- 
formed response for the current estimate /o 

(27) Z, = UX,)+ Y,-p{XfJo) 



p{Xi;fo){l-p{Xi-Jo)) 

and weights w{Xi) = p(Xj;/o)(l — p{Xi; /q), and carries out a weighted 
backfitting of {Z, X) with weights w. The weighted smooth is given by 

(28) P, = 

To incorporate the sparsity penalty, we first note that the Lagrangian is 
given by 



(29) £(/, A) = E [log (l + e/(^)) - Yf{X)\ + X \^ ^njfiXj)) - L 

and the stationary condition for component function fj is E, {p — Y \ Xj) -\- 
Xvj = where Vj is an element of the subgradient d^K{fj). As in the un- 
regularized case, this condition is nonlinear in /, and so we linearize the 
gradient of the log- likelihood around /q. This yields the linearized condi- 
tion E [w{X){f{X) - Z) I Xj] + Xvj = 0. When E{fJ) / 0, this implies the 
condition 

(30) \Eiw\Xj) + -j== I f, {X, ) = E{wR, I Xj , 
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In the finite sample case, in terms of tlie smootliing matrix Sj, this becomes 

Sj {wRj ) 



(31) 



SjW + A 



If ||5j(wi?j)|| < A, then fj = 0. Otherwise, this imphcit, nonhnear equation 
for fj cannot be solved explicitly, so we propose to iterate until convergence: 



(32) 



Sj {wRj ] 



SjW + A^/||/,-|| ■ 



When A = 0, this yields the standard local scoring update (p8]) . An example 
of logistic SpAM is given in Section [H 

5. Choosing the Regular izat ion Parameter. We choose A by min- 
imizing an estimate of the risk. Let Vj be the effective degrees of freedom 
for the smoother on the j^^ variable, that is, Vj = trace(6'j) where Sj is the 
smoothing matrix for the j-th dimension. Also let ct^ be an estimate of the 
variance. Define the total effective degrees of freedom as 

(33) df(A) = ^z.,-/(||/,||/0). 

3 

Two estimates of risk are 
(34) 



Cp = -Eh^^-E/.(^.) +^df(A) 
i=l \ j=l I 



and 



(35) 



GCV(A) 



^Er=i(y.-E,/;(^.,))^ 

(l-df(A)/n)2 



The first is Cp and the second is generalized cross validation but with degrees 
of freedom defined by df(A). A proof that these are valid estimates of risk is 
not currently available; t hus, these should be rega r ded a s heuristics. 

Based on the results in lWasserman and Roederl (120071 ) about the lasso, it 
seems likely that choosing A by risk estimation can lead to overfitting. One 
can further clean the estimate by testing Hq : fj = for all j such that 
fj 7^ 0. For example, the tests in iFan and Jia^ (|2005l ') could be used. 
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6. Examples. To illustrate the method, we consider a few examples. 
Synthetic Data. Our first example is from (|Hardle et al.l . \2Q0i ) . We gen- 



erated n = 150 observations from the following 200-dimensional additive 
model: 

(36) Yi = fl{xii) + f2{xi2) + fsiXis) + h{Xi4,) + €i 

h{x) = -2sin(2x), h{x) = x^-\, h{x) =^-\^ U^) = e"" + e"^ - 1 

and fj{x) = for j > 5 with noise ~ AA(0, 1). These data therefore have 
196 irrelevant dimensions. 

The results of applying SpAM with the plug-in bandwidths are summa- 
rized in Figure El The top-left plot in Figure [3] shows regularization paths 
as the parameter A varies; each curve is a plot of ||/j(A)|| versus 

(37) ELill/fc(A)ll 



max,ELill/fc(A)ll 



for a particular variable Xj. The estimates are generated efficiently over 
a sequence of A values by "warm starting" fj{\t) at the previous value 
fj{Xt-i)- The top-center plot shows the Cp statistic as a function of A. The 
top-right plot compares the empirical probability of correctly selecting the 
true four variables as a function of sample size n, for p = 128 and p = 256. 
This behavi or suggests t h e sam e threshold phenomenon that was shown for 
the lasso by Wainwright ( 20061 ). 



Boston Housing. The Boston housing data were collected to study house 
values in the suburbs of Boston. There are 506 observations with 10 covari- 



ates. The dataset has been studied by many other authors (IHardle et al 



2004 : Lin and Zhanel - bood l. with various transformations proposed for dif- 



ferent covariates. To explore the sparsistency properties of our method, we 
add 20 irrelevant variables. Ten of them are randomly drawn from Uniform(0, 1), 
the remaining ten are a random permutation of the original ten covariates. 
The model is 

Y = a /i(crim) + /2(indus) + /3(nox) /4(rm) + /5(age) 
(38) + /6(dis) + /7(tax) + /8(ptratio) + /9(b) + /lo(lstat) + e. 

The result of applying SpAM to this 30 dimensional dataset is shown 
in Figure [H SpAM identifies 6 nonzero components. It correctly zeros out 
both types of irrelevant variables. From the full solution path, the important 



12 



RAVIKUMAR, LAFFERTY, LIU, AND WASSERMAN 




0.0 0.2 0.4 0.6 O.B 1.0 



0.0 0.2 0.4 0.6 0.8 1.0 




102030405060708090 110 130 150 

sample size 



11=97.05 




0.0 0.2 



0.4 0.6 
Xl 



1.0 



11=88.36 







CD 








OJ- 




CM 




CM 




CO 


E- 




E 








1 " 




1 








1 " 




1 






CD 

1 



0.0 0.2 0.4 0.6 O.i 
x2 



1.0 



11=90.65 



0.0 0.2 0.4 0.6 0.8 1.0 
x3 



11=79.26 



0.0 0.2 



0.4 0.6 
x4 



0.8 1.0 



E 



zero 



0.0 0.2 



0.4^^.6 



0.8 1.0 



zero 



0.0 0.2 



0.4 n.6 
x6 



0.8 1.0 



Fig 3. (Simulated data) Upper left: The empirical £2 norm of the estimated components as 
plotted against the regularization parameter \; the value on the x-axis is proportional to 
X/j Upper center: The Cp scores against the regularization parameter A; the dashed 

vertical line corresponds to the value of A which has the smallest Cp score. Upper right: 
The proportion of 200 trials where the correct relevant variables are selected, as a function 
of sample size n. Lower (from left to right): Estimated (solid lines) versus true additive 
component functions (dashed lines) for the first 6 dimensions; the remaining components 
are zero. 
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Fig 4. (Boston housing) Left: The empirical £2 norm of the estimated components versus 
the regulanzation parameter X. Center: The Cp scores against X; the dashed vertical line 
corresponds to best Cp score. Right: Additive fits for four relevant variables. 
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variables are seen to be rm, Istat, ptratio, and crim. The importance of 
variables nox and b is borderline. These results are b asically consistent with 
those obtained by other authors ( Hardle et al. . 20041 ) . However, using Cp as 
the selection criterion, the variables indux, age, dist, and tax are estimated 
to be irrelevant, a result not seen in other studies. 



SpAM for Spam. Here we consider an email spam classification problem, 
using the logistic Sp AM backfitting algo rithm from Section [3l This dataset 
has been studied by iHastie et al.l (120011 ). using a set of 3,065 emails as a 
training set, and conducting hypothesis tests to choose significant variables; 
there are a total of 4,601 observations with p = 57 attributes, all numeric. 
The attributes measure the percentage of specific words or characters in 
the email, the average and maximum run lengths of upper case letters, and 
the total number of such letters. To demonstrate how SpAM performs with 
sparse data, we only sample n = 300 emails as the training set, with the 
remaining 4301 data points used as the test set. We also use the test data 
as the hold-out set to tune the penalization parameter A. 

The results of a typical run of logistic SpAM are summarized in Figure [5l 
using plug- in bandwidths. It is interesting to note that even with this rel- 
atively small sample size, logistic SpAM recovers a sparsity pattern that is 
consistent with previous authors' results. For example, in the best model 
chosen by logistic SpAM, according to error rate, the 33 selected variable s 
cover 80% of the significant predictors as determined bv lHastie et al.l (j200ll ). 



Functional Sparse Coding. lOlshausen and Fieidl kmd ) propose a method 
of obtaining sparse representations of data such as natural images; the mo- 
tivation comes from trying to understand principles of neural coding. In this 
example we suggest a nonparametric form of sparse coding. 

Let =i,...,N be the data to be represented with respect to some 

learned basis, where each instance 2/^*^ G is an ra-dimensional vector. The 
linear sparse coding optimization problem is 



TV 



(39) 
(40) 



mm 

I3,X 

such that 



5Z 1 Or, 



1 



2n 
< 1 



+ A 



/3« 



Here X is an n x p matrix with columns Xj, representing the "dictionary" 
entries or basis vectors to be learned. It is not required that the basis vectors 
are orthogonal. The ii penalty on the coefficients /J^*^ encourages sparsity, so 
that each data vector is represented by only a small number of dictionary 
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Fig 5. (Email spam) Classification accuracies and variable selection for logistic SpAM. 



elements. Sparsity allows the features to specialize, and to capture salient 
properties of the data. 

This optimization problem is not jointly convex in and X. However, 
for fixed X, each weight vector P^^^ is computed by running the lasso. For 
fixed the optimization is similar to ridge regression, and can be solved 
efficiently. Thus, an iterative procedure for (approximately) solving this op- 
timization problem is easy to derive. 

In t he case of sparse coding of natural images, as in lOlshausen and Field 
( 19961 ) ■ the basis vectors Xj encode basic edge features. A code with 200 
basis vectors, estimated by carrying out the optimization using the lasso 
and stochastic gradient descent, is shown in Figure [6l The codewords are 
seen to capture edge features at different scales and spatial orientations. 

In the functional version, we no longer assume a linear, parametric fit 
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Fig 6. A sparse code with 200 codewords, trained on 200,000 patches from natural images, 
each patch 16 x 16 pixels in size, using the lasso and stochastic gradient descent. The 
codewords are seen to capture edge features at different scales and spatial orientations. 



between the dictionary X and the data y. Instead, we model the relationship 
using an additive model: 



(41) 



where Xj G MP is a dictionary vector and e*^*) G is a noise vector. This 
leads to the following optimization problem for functional sparse coding: 



mm 



(42) 

(43) such that 



Figures [7] and [8] illustrate the reconstruction of different image patches 
using the sparse linear model compared with the sparse additive model. The 
codewords Xj are those obtained using the Olshausen-Field procedure; these 
become the design points in the regression estimators. Thus, a codeword for 
a 16 X 16 patch corresponds to a vector Xj of dimension 256, with each Xij 
the gray level for a particular pixel. 

It can be seen how the functional fit achieves a more accurate approxi- 
mation to the image patch, with fewer codewords. Each set of plots shows 
the original image patch, the reconstruction, and the codewords that were 
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Lasso 

Original patch RSS = 0.0561 






■ MR 



,5.-, 
in 







SpAM 

Original patch RSS = 0.0206 












Fig 7. Comparison of sparse reconstruction using the lasso (left) and SpAM (right). 



used in the reconstruction, together with their marginal fits to the data. 
For instance, in Figure [7| it is seen that the sparse Unear model uses eight 
codewords, with a residual sum of squares (RSS) of 0.0561, while the sparse 
additive model uses seven codewords to achieve a residual sum of squares 
of 0.0206. It can also be seen that the linear and nonlinear fits use different 
sets of codewords. Local linear smoothing was used with a Gaussian kernel 
having fixed bandwidth h = 0.05 for all patches and all codewords. 

These results are obtained using the set of codewords obtained under the 
sparse linear model. The codewords can also be learned using the sparse 
additive model; this will be reported in a separate paper. 

7. Theoretical Properties. 

7.1. Sparsistency. In the case of linear regression, with fj{Xj) = Pj'^Xj, 
several authors have shown that, under certain conditions on n, p, the num- 
ber of relevant variables s = |supp(/3*)|, and the design matrix X, the lasso 
recovers the sparsity pattern asymptotically; that is, the lasso estimator 
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Fig 8. Comparison of sparse reconstruction using the lasso (left) and SpAM (right). 
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Lasso reconstruction 



SpAM reconstruction 





Codewords/patch 8.14, RSS 0.1894 Codewords/patch 8.08, RSS 0.0913 



Fig 9. Image reconstruction using the lasso (left) and SpAM (right). The regularization 
parameters were set so that the number of codewords used in each reconstruction was 
approximately equal. To achieve the same residual sum of squares (RSS), the linear model 
requires an average of more than 26 codewords per patch. 



is sparsistent: 
(44) 



P(supp(r) = supp(^. 



Here, s upp(/ 3 ) = I? : 0i 0|. Referen c es inc lude Wainwright ( 200^ ). Meinshausen and P. Biihlmann 
(|2006l '). [Zoul pOOSl ^ and Izhao and Yul (|2007l l. We show a similar result for 
sparse additive models under orthogonal function regression. 
In terms of an orthogonal basis ip, we can write 



(45) 



=1 k=l 



To simplify notation, let Pj be the d„ dimensional vector {(3jk, k = 
l,...,dn} and let be the n x dn matrix ^j[i,k] = ipjkiXij). If ^ C 
{1, . . . ,p}, we denote by the n x d\A\ matrix where for each j G A, 
appears as a submatrix in the natural way. 

We now analyze the sparse backfitting Algorithm ([T]) assuming an or- 
thogonal series smoother is used to estimate the conditional expectation in 
its Step (2). As noted earlier, an orthogonal series smoother for a predictor 
Xj is the least squares projection onto a truncated set of basis functions 
{tpji, . . . , ipjd}- Combined with the soft-thresholding step, the update for fj 
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in Algorithm ([T]) can thus be seen to solve the following problem, 
1 



mm 



p 2n 



1 



n 



where ||f Hi denotes ^27=1 '^f ~ ^ ~ ^i^j ^il^i the residual for fj. 

The sparse backfitting algorithm can then be seen to solve, 



mm{i?„(/3) + A„f)(/?)} 



(46) 



(47) 



. 1 

mm — 

(5 2n 



mm — 

f3 2n 



where ii^ denotes the squared error term and 0, denotes the regularization 
term, and each Pj is a d„-dimensional vector. Let S denote the true set of 
variables {j : fj 7^ 0}, with s = \S\, and let 5*^ denote its complement. Let 
Sn = {j ■ /3j / 0} denote the estimated set of variables from the minimizer 
Pn of ([T7|) . For the results in this section, we will treat the covariates as 
fixed. 



Theorem 2. Suppose that the following conditions hold on the design 
matrix X in the orthogonal basis tp: 



(48) 

(49) 
(50) 



A 



1 

max 1 -'^s'^S 

n 



A 



mm 1 ~ 

n 



Ms] > C„ 



< 00 



> 



max 



< 



Cry 



-, for some < 5 < 1. 



A.ssume that the truncation dimension d^ satisfies dn — > 00 and dn = o{n) . 
Furthermore, suppose the following conditions, which relate the regulariza- 
tion parameter A„ to the design parameters n,p, the number of relevant 
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variables s, and the truncation size dn: 

s 



(51) 
(52) 

(53) 



dnlog {dn{p - s)) 




log{sdn) , S^/^ , , 
1 ; r Ar. 



n 



dn 



' sdn 



where = mirijg^ ||oo- Then SpAM is sparsistent: P {^Sn = 5^ ^ 1. 
The proof is given in an appendix. Note that condition (|50p imphes that 



(54) 
(55) 
since 



max 



< 



' Cmin dfi 



(1-5) 



l^ll 



y Cmax 

for an TTi X n matrix A. This relates the 



< ll^ll < y/m\\A 

condition to previous oo-norm inc oherence condition s that have been used 
for sparsistency in the hnear case (IWainwrightJ . l2nn()l ). 

For 1/ = 2 we take d„ = n^/^, which achieves the minimax error rate in 
the one-dimensional case. The theorem under this design setting, with the 
simplifying assumption that s = 0(1), gives the following 

Corollary 1. Suppose that s = 0(1), and dn = 0{n^/^). Assume the 
design conditions (I48p . (I49p and (j50p . Suppose the penalty A„ is chosen to 
satisfy 



(56) 



n 



"'\n 



OO, 



log(?7,p) 

n4/5A2 



0, 



0. 



Then¥{Sn = 5) ^ 1. 



For example, under these conditions and p* x 1, the dimension pn can be 



3/5 

taken as large suitable choice for the regularization parameter in 

this case would be = Cn~To+^logn for some 5 > 0. 
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7.2. Persistence. The previous assumptions are very strong. They can 
be weakened at the expense of getting weaker results. In particular, in 
the section we do not assume that the true regression funct i on is addi- 
tive. We use arguments like those in Juditskv and Nemirovski ( 200d ) and 
Greenshtein and Ritov ( 20041 ) in the context of linear models. In this section 
we treat X as random and we use triangular array asymptotics, that is, the 
joint distribution for the data can change with n. Let {X, Y) denote a new 
pair (independent of the observed data) and define the predictive risk when 
predicting Y with v{X) by 



(57) 



R{v) =E(Y-v{X)f 



When v{x) = J2j PjQji^j) also write t he risk as R(l3, q) wher e = 
(/?!, . . . , Pp) and g = {gi, . . . , gp). Following iGreenshtein and Ritovl (j2004l ) 
we say that an estimator m„ is persistent (risk consistent) relative to a class 
of functions Ain, if 



(58) 



R{mn) - R{ml) ^ 



where 



(59) 



m* = argminii(t') 



is the predictive oracle. Greenshtein and Ritovl (2004) show that the lasso is 
persistent for Mn = {^{x) = [3 : \\(3\\-^ < L„} and L„ = o((n/ log n)-^/^). 
Note that m* is the best linear approximation (in prediction risk) in 
but the true regression function is not assumed to be linear. Here we show 
a similar result for SpAM. 

In this section, we assume that the SpAM estimator fhn is chosen to 



minimize 



(60) 



-Y^{Y,-Y^P,g,{X,,)f 



subject to WPWi < Ln and gj G Tj. We make no assumptions about the 

design matrix. Let M.n = -A^n(-^n) be defined by 

(61) 

Mn=\m: m{x) = Y.P,g,{x,) : E{gj) = 0, E{g]) = 1, ^ < lA 
I j=i j ) 



and let m* = argmin^g^^ R{v). 
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Theorem 3. Suppose that pn < e"* for some ^ < 1. Then, 
(62) R{mn) - R{ml) = Op (^^(^It^) 

and hence , if Ln = o{n^^~^^/^) then SpAM is persistent. 

8. Discussion. The results presented here show how many of the re- 
cently established theoretical properties of regularization for linear models 
extend to sparse additive models. The sparse backfitting algorithm we have 
derived is attractive because it decouples smoothing and sparsity, and can be 
used with any nonparametric smoother. It thus inherits the nice properties 
of the original backfitting procedure. However, our theoretical analyses have 
made use of a particular form of smoothing, using a truncated orthogonal 
basis. An important problem is thus to extend the theory to cover more 
general classes of smoothing operators. 

An additional direction for future work is to develop procedures for auto- 
matic bandwidth selection in each dimension. We have used plug-in band- 
widths and truncation dimensions dn in our experiments and theory. It is of 
particular interest to develop procedures that are adaptive to different levels 
of smoothness in different dimensions. 

Finally, we note that while we have considered basic additive models that 
allow functions of individual variables, it is natural to consider interactions, 
as in the functional ANOVA model. One challenge is to formulate suitable 
incoherence conditions on the functions that enable regularization based 
procedures or greedy algorithms to recover the correct interaction graph. 



In the parametric setting, one result in this direction is IWainwright et al. 
(|2007l '). 



9. Proofs. 

Proof of Theorem [H Consider the minimization of the Lagrangian 
(63) min £(/, A) ^ (y - f^^^^)f + ^Y. ^nf]{X,)) 



i=i 



with respect to fj G TCj, holding the other components {fk, k ^ j} fixed. The 
stationary condition is obtained by setting to zero the Frechet directional 
derivative with respect to fj, denoted dC{f, \;r]j), for all feasible directions 
VjiXj) G TCj (E(??j) = 0, E(?7|) < oo). This leads to the condition 

(64) dC{f, A; r?) = [(/,■ - R, + Xv,) r^,] = 
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where Rj = Y — J2ky^j fk is the residual for fj, and Vj is an element of the 



subgradient d^E{ff), satisfying vj E Hj; vj = fj/^E{fj)^ if IE(//) / and 

belonging to the set {uj G 7ij\ ]E(ti|) < 1} otherwise. 

Using iterated expectations, the above condition can be rewritten as 



(65) 



E[{f, + Xv,-E{Rj\X,))r^j]=0. 



But since fj — E{Rj\Xj) + Xvj G 7ij, we can compute the derivative in the 
direction rjj = fj — E{Rj\Xj) + Xvj G 7ij, implying that 



E 



{f,{x,) - E{R,\X, = Xj) + Xvjixj)y 



0; 



(66) 
that is, 

(67) fj + Xvj =E{Rj\Xj) a.e. 

Denote the conditional expectation E{Rj\Xj) — also the projection of the 



residual Rj onto Tij—hy Pj. Now if E(//) / 0, then Vj 



from condition ()67p implies 
(68) 

(69) 



, which 



E{Pf) 



IE 



1 + 



/. + A/,/v/lE(// 
A 



(70) 
(71) 



K(/|) + A 



> A. 



If E(//) = 0, then fj = a.e., and jE{v]) <1. §7!) implies that 



(72) ^EiPf) = X^Eiv]) < A. 

We thus obtain the equivalence 
(73) 



E{Pf) < A <^ fj = a.s. 
Rewriting equation ([67|) in light of ([73]). we obtain 

A 



1 + 



fj = Pj if ^E{Pf) > X 
fj = otherwise. 
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Using ([TO]) , we thus arrive at the soft thresholding update for fj: 

A 



(74) 



f, 



1 



E{Pf) 



where [•]_!_ denotes the positive part and Pj = ¥,[Rj \ Xj]. □ 

Proof of Theorem [2l A vector /3 G R^"P is an optimum of the objective 
function in (j46p if and only if there exists a subgradient g € dVl{j3), such 
that 



(75) 



n 



0. 



The subdifferential dQ{P) is the set of vectors g G M^*^" satisfying 

= » ^ if /5 . / 



n 



Our argument closely follows the approach of Wainwright ( 20061 ) in the 
linear case. In particular, we proceed by a "witness" proof technique, to show 
the existence of a coefficient-subgradient pair {P,g) for which supp(/3) = 
supp(/?*). To do so, we first set Ps" = and gs = dCl{(3*)s, and we then 
obtain Ps and 55c from the stationary conditions in (|75|) . By showing that, 
with high probability. 



9j 



n 



Pj / for j G 5 



gj < 1 for j G 5^ 



this demonstrates that with high probability there exists an optimal solution 
to the optimization problem in (|46p that has the same sparsity pattern as 
the true model. 

Setting Ps" = and 



(76) 



9j 
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for j £ S, the stationary condition for f3s is 



(77) 



-^-5 i^sPs -Y) + Xnds = 0. 



Let V = Y — ^sPs ~ ^ denote the error due to finite truncation of the 
orthogonal basis, where W = (ei, . . . , e„)"^. Then the stationary condition 
can be written as 



(78) 



or 



n 



1 

n 



n 



(79) f3s-(3*s=(-^^^s) { + -^^sV - ^ngs 



1 



n 



n 



assuming that ^^'^^5 is nonsingular. RecalHng our definition 

(80) p; = min||/3*||oo >0. 

it suffices to show tliat 

(81) \\Ps-PMoo<^ 

in order to ensure that supp(/JJ) = supp(/35') = {j : \\Pj\\oo / 0}. 

Using Tjss = ^ (^5^s) to simphfy notation, we have the £00 bound 



(82) 
(83) 



- HSWoo 
-1/1 



ss 



< 



+ 



^sl{hny)\\^+>^n\\^sl9s 



We now proceed to bound the quantities above. First note that for j G S, 



(84) 
(85) 



9j 



> 



1 



Cn 



and thus \\gj\\ < VC'max- Therefore, since 
(86) 

we have that 



gsWoo = rnaxllsfjlloo < max||gj||2 < vCn 
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Now, to bound ^^5!^ , first note that, as we are working over the 



Sobolev spaces Sj of order two. 



\V,.\ 



jeS k=d„+l 



jeSk=d„+l 



jeSk=d„+l 

(90) < sBC 



\ fc=d„+l 



A;4 - ,3/2 



sB' 



\ k = dn- 



^+1 



y - 



for some constant B' > Therefore, 



(91) 

and also 
(92) 



\V\ 



< 



B's 

^3/2' 







1 




< 








n 



WW <^ 

\V\\oo^ 3/2 



where D denotes a generic constant. Together then, we have that 

\\PS-P*s\\oo < 

(93) 



-1 / 1 



^ss 



n J 



+ 



^SS 



Ds 
IP 



Finally, consider Z := S^^ (^^^VFj . Note that ~ N{0, a^I), so that Z 
is Gaussian as weh, with mean zero. Consider its l-th. component, Zi = ej Z. 
Then E[Zi] = 0, and 



(94) 



Var(Z/) = — ejT,ggei < 



n 



C'min^ 



By Gaussian comparison results ( Ledoux and Talagrandl . Il99ll ). we have 
then that 



(95) mZWoo] < 3./log(sfi„)||Var(Z)||^ < Say 



'log(sd„) 



nC„ 
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An application of Markov's inequality then gives that 



35-/3SI|oo>^) < 



(96) 



■2' oo + 



■'SS 



(97) < -{e[||Z||oo] + 

Pn 



(98) 



< — <3C71 

Pr, 



'log{sdn) 



nC„ 



'SS 



+ 



^ss 



(m-3/2 + a„Vc^)} 

/ Ds 



which converges to zero under the condition that 
(99) — <W + 



1 

Pn 



n 



n 



3/2 



0. 



Now, since 
(100) 

condition (1991) holds in case 



< 



^ sd„ 



Cn 



(101) 



1 

P*n 



h —J h XnV Sdn 



0, 



which is condition (j53p in the statement of the theorem. 

We now analyze ^^c. Recall that we have set Pgc = (3%,^ = 0. The station- 
ary condition for j £ is thus given by 



(102) 

Therefore, 
gs'^ = 

(103) = 

(104) = 



i*} (I'sfe - ts/JJ - IV - V) + A„9j = 0. 



An n 



1 



-1 



1 



( -*5^S ) ( \ngs - -"^IW - -^IV 



1 



n 



n 



n J 

^ [\ngs - ^^S^ - ^^SV^) + ^*5= (VF + l^)| 
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from equation (|79]) . 
We require that 

(105) 

for all j £ S"^. Since 
(106) 



n 



9j < 1 



T 
9j 



n 



-1 



9i 



■Ib-ll^ 



it suffices to show that max^gsc ||(7j|| < \JC min- 

From (|lU4p . we see that gj is Gaussian, with mean 



(107)/x, =E(g,) = Sj5SsH55 



1 fl 



1 /I 



A„ Vn 



We then obtain the bound 

ll^ill < 

(108) 



\9s\\ + T- 



n * 



1 

An 



n J 



VsCmax + T~ 



n J 



1 

An 



By our earlier calculations, we have that 



(109) 

(110) 

Therefore 

(111) 



1 ,T,Tt 



\ n J II — 



1,T,T 



< 



dr, 



+ 



Now suppose that 



(112) 
(113) 



s 



< 



Cmiri 1 — 5 



for some 5 > 



0, 



which are conditions (|50p and (j51|) in the statement of the theorem. Then 

2Ds 



(114) 



ll^jll < \/C~{l-5) + 
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and in particular ||^j|| < i/Cmin for sufficiently large n. It therefore suffices 
to show that 

(115) P(max||^,-/.,|U>^)^0 
since this implies that 

(116) ||5ill < ll/^jll + -/^ill 

(117) < ll^jll + V^IISj - Mjlloo 



(118) < VC~{l-5) + - + o{l) 

with probability approaching one. To show (I115p . we again appeal to Gaus- 
sian comparison results. Define 

T fr ,T, _/,T,T,T, \-l,T,T\ ^ 



(119) Zj = v&j ^^I-^^,s{^i.^sr''fs) — 



for j G 5^^. Then Zj are zero mean Gaussian random variables, and we need 
to show that 



/ II V'. 

(120) P max " > 



oo 



A calculation shows that E(Z?^.) < jn. Therefore, we have by Markov's 
inequality and Gaussian comparison that 

fmaxHdk>^) < ^Efmax|Z,,| 



(121) < [^^\og{{p - s)dn) max [z%) 

~ 6Xn\ n 
which converges to zero under the condition that 

d„,log((p - s)dn) 
This is condition (1521) in the statement of the theorem. □ 
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Proof of Theorem [3l We begin with some notation. If is a class of 
functions then the L^o bracketing number N^^{e,M) is defined as the small- 
est number of pairs B = {{ii, ui), . . . , {ik,Uk)} such that \\uj — < e, 
^ ^ j ^ k, and such that for every m € A4 there exists {i, u) £ B such that 
£ < m < u. For the Sobolev space Tj, 

(124) logiV[](e,7;) <K' 



for some K > 0. The bracketing integral is defined to be 

(125) J[](5,7W) = ^log N[]{u,M)du. 
From Corollary 19.35 of van der Vaart ( 19981 ). 

(126) EfBup|,lb)-„b)l)<^«=^ 

for some C > 0, where F{x) = sup^g^^j 15(2;) 1 1 f^io) = ^{di^)) V-id) = 
Set Z = (Zq, . . . , Zp) = {Y,Xi,..., Xp) and note that 

(127) Ri(3,g) = E E PjPkn9j{Zj)gk{Zk)) 

j=0 k=0 

where we define 50 (■^o) = zq and (3o = —1. Also define 

-y n p p 

(128) R{P,g) = - E E E f3J(3kgJ{Z^J)gk{Z^k)■ 

1=1 j=Ok=0 

Hence m„ is the minimizer of R{P, g) subject to the constraint Ylj Pj9ji^j) ^ 
MniLn) and gj £ Tj. For ah {P,g), 

(129) \R{P,g) - R{P,g)\ < WPWl max sup \fijk{g) - Hki9)\ 

where fijkig) = n"^ XliLi Y.jk9j{Zij)gk{Zik) and iijkig) = ¥.{gj{Zj)gk{Zk)). 
From (fT2i]) it follows that 



(130) logiVn(e,7W„) < 21ogp„ + i^ 



1\V2 
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Hence, J[](C, = 0{\/log Pn) and it follows from ()126p and Markov's 

inequality that 

(131) 

max sup Ifljkig) - I^jki9)\ = Op I W i^^^ 1 = Op ( ^ 
We conclude that 



n 



(132) sup \R{g) - R{g)\ = Op 

Therefore, 



R{m*) < R{mn) <R{mn)+Op[^^ 



(l-5)/2 



n(i-?)/2 



and the conclusion follows. □ 
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